Linear Regression
method is one of the most widely used and common methods examining the linear relationship of the dependent
and independent
variable(s).dependent variable
usually represented by \(Y\).Independent variables
may be one or more than one usually represented by \(Xs\).life expectancy
depends on the Income Level
?Income leevel
, we will be using GDP Per Capita Income
datar
- name of the imported data file inR
sdata.csv
name of the csv
file being imported
'data.frame': 62 obs. of 3 variables:
$ Year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
$ LE : num 45.2 45.4 45.7 45.9 46.2 ...
$ GDPPC: num 165 168 168 175 183 ...
n
rows of observations of the data set
head(data.frame name, n)
n
rows of observations of the data set
tail(data.frame name, n)
Year LE GDPPC
1 1960 45.218 165.2733
2 1961 45.398 167.5203
\[y = \beta_0 + \beta_1*x + u\]
We say \(y\) depends on \(x\) and read as \(y\) is equal to \(\beta_1\) times \(x\), plus a constant \(\beta_0\), plus an error term \(u\).”
When you have multiple independent variables, the equation can be written as \(y = \beta_0 + \beta_1\times x_1 + \beta_2\times x_2 + ... + \beta_n\times x_n\), where:
\(\beta_0\) is the intercept,
\(\beta_1, \beta_2, \cdots,\beta_n\) are the regression or slope coefficients associated with the predictors \(x_1, x_2, \cdots, x_n\).
\(u\) is the error term- the part of \(y\) that can be explained by the regression model
life expectancy
and Income level
R
can be estimated using lm
functionlm([target/dependent var] ~ [predictor / independent var], data = [data source])
help(lm)
Call:
lm(formula = LE ~ GDPPC, data = datar)
Residuals:
Min 1Q Median 3Q Max
-8.627 -3.498 1.082 3.559 4.791
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.26149 0.93112 50.76 <2e-16 ***
GDPPC 0.02698 0.00193 13.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.02 on 60 degrees of freedom
Multiple R-squared: 0.7651, Adjusted R-squared: 0.7612
F-statistic: 195.4 on 1 and 60 DF, p-value: < 2.2e-16
The summary outputs shows 6 components, including:
Call: Shows the function call used to compute the regression model.
Residuals: Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.
The first step in interpreting the simple/multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen that
p-valueof the
F-statisticis less than
2.2e-16`, which is highly significant. This means that, predictor variables is significantly related to the outcome variable.
Next is Coefficients significance
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.26148728 0.931115969 50.75790 5.373627e-51
GDPPC 0.02697652 0.001929822 13.97876 1.565652e-20
For a given the predictor, the t-statistic
evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that, change in income level is significantly associated to the changes in life expectancy in India.
For a given predictor variable, the coefficient \(\beta\) can be interpreted as the average effect on \(y\) of a one unit increase in predictor(\(x\)).
In our example as income (GDP per capita) increases by one 100 Rs, life expectancy is increasing by 2.69 years.
Next step is to check how good is the model that means how well the model explains the data.
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:
Residual Standard Error (RSE),
R-squared \((R^2)\) and \(adjusted~R^2\),
F-statistic
In this example, the RSE = 4.02, meaning that the observed values deviate from the predicted values by approximately 4.02 units in average.
In thi example, the adjusted R2 is 0.7612, which is good.